智能论文笔记

Towards Disentangling Relevance and Bias in Unbiased Learning to Rank

Yunan Zhang , Le Yan , Zhen Qin , Honglei Zhuang , Jiaming Shen , Xuanhui Wang , Michael Bendersky , Marc Najork

分类：人工智能

2022-12-28

Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose three methods to mitigate the negative confounding effects by better disentangling relevance and bias. Empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches.

translated by 谷歌翻译

DSI++: Updating Transformer Memory with New Documents

Sanket Vaibhav Mehta , Jai Gupta , Yi Tay , Mostafa Dehghani , Vinh Q. Tran , Jinfeng Rao , Marc Najork , Emma Strubell , Donald Metzler

分类：自然语言处理 | 人工智能 | 机器学习

2022-12-19

Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting by a significant margin. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

translated by 谷歌翻译

Data-Efficient Information Extraction from Form-Like Documents

Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

分类：机器学习

2022-01-07

由于其对金融服务，保险和医疗保健等许多行业的自动化业务工作流程的潜在影响，自动化信息提取的信息从格式的信息提取是一种压迫需求。关键挑战是这些业务工作流中的形式类似的文件可以在很多无限的方式下放出;因此，对此问题的良好解决方案应该概括到具有看不见的布局和语言的文档。此问题的解决方案需要对文档中的文本段和视觉提示的全面了解，这是非微不足道的。虽然自然语言处理和计算机视觉社区开始解决这个问题，但在（1）数据效率上没有大量关注（2）跨越不同文档类型和语言的能力。在本文中，我们认为，当我们只有少量标记的培训文件（〜50）时，从相当大的结构不同的较大标记的语料库中的简单转移学习方法产生高达27 f1点的改进，即在简单的训练上目标域中的小语料库。我们通过简单的多域转移学习方法改进了这一点，目前正在生产使用中，并表明这达到了8个F1点的改进。我们使数据效率至关重要，使信息提取系统能够扩展以处理数百种不同的文档类型，并且学习良好的表示对于实现这一目标是至关重要的。

translated by 谷歌翻译

Rank4Class: A Ranking Formulation for Multiclass Classification

Nan Wang , Zhen Qin , Le Yan , Honglei Zhuang , Xuanhui Wang , Michael Bendersky , Marc Najork

分类：机器学习 | 人工智能

2021-12-17

多字符分类（MCC）是一个基本机器学习问题，其旨在将每个实例分类为预定义的类集中的一个。鉴于实例，分类模型计算每个类的分数，然后所有类别都用于对类进行排序。分类模型的性能通常通过TOP-K精度/误差（例如，k = 1或5）来测量。在本文中，我们不会旨在提出新的神经表征学习模型，因为最近的作品，但要表明通过排名镜头可以轻松提高MCC性能。特别是，通过将MCC视为对实例的等级等级，我们首先争辩说排名指标，例如归一化的折扣累积增益（NDCG），可以比现有的Top-K度量更具信息化。我们进一步证明主导的神经MCC架构可以用特定的设计选择制定为神经排名框架。基于这种概括，我们表明，利用丰富的信息检索文献利用技术将技术效果简单，直观地将MCC性能从盒子中提高。具有不同数据集和骨干型号的文本和图像分类任务的广泛经验结果（例如，用于文本和图像分类的BERT和RESET）显示了我们提出的框架的价值。

translated by 谷歌翻译

Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients

Marc Härkönen , Markus Lange-Hegermann , Bogdan Raiţă

分类： (统计)机器学习 | 机器学习

2022-12-29

Partial differential equations (PDEs) are important tools to model physical systems, and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works like a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDE, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.

translated by 谷歌翻译

Fast and fully-automated histograms for large-scale data sets

Valentina Zelaya Mendizábal , Marc Boullé , Fabrice Rossi

分类：机器学习 | (统计)机器学习

2022-12-27

G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insights about their asymptotic behavior and are used to speed up their optimisation. These insights, combined to a greedy search heuristic, are used to construct histograms in linearithmic time rather than the polynomial time incurred by previous works. The capabilities of the proposed MDL density estimation method are illustrated with reference to other fully automated methods in the literature, both on synthetic and large real-world data sets.

translated by 谷歌翻译

Removing Objects From Neural Radiance Fields

Silvan Weder , Guillermo Garcia-Hernando , Aron Monszpart , Marc Pollefeys , Gabriel Brostow , Michael Firman , Sara Vicente

分类：计算机视觉

2022-12-22

Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner. We validate our approach using a new and still-challenging dataset for the task of NeRF inpainting.

translated by 谷歌翻译

Co-clustering based exploratory analysis of mixed-type data tables

Aichetou Bouchareb , Marc Boullé , Fabrice Clérot , Fabrice Rossi

分类：机器学习 | (统计)机器学习

2022-12-22

Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.

translated by 谷歌翻译

Model Based Co-clustering of Mixed Numerical and Binary Data

Aichetou Bouchareb , Marc Boullé , Fabrice Clérot , Fabrice Rossi

分类：机器学习 | (统计)机器学习

2022-12-22

Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block models based co-clustering to the case of mixed data (continuous and binary variables). We then evaluate the effectiveness of the proposed approach on simulated data and we discuss its advantages and potential limits.

translated by 谷歌翻译

Beyond Surrogate Modeling: Learning the Local Volatility Via Shape Constraints

Marc Chataigner , Areski Cousin , Stéphane Crépey , Matthew Dixon , Djibril Gueye

分类： (统计)机器学习

2022-12-20

We explore the abilities of two machine learning approaches for no-arbitrage interpolation of European vanilla option prices, which jointly yield the corresponding local volatility surface: a finite dimensional Gaussian process (GP) regression approach under no-arbitrage constraints based on prices, and a neural net (NN) approach with penalization of arbitrages based on implied volatilities. We demonstrate the performance of these approaches relative to the SSVI industry standard. The GP approach is proven arbitrage-free, whereas arbitrages are only penalized under the SSVI and NN approaches. The GP approach obtains the best out-of-sample calibration error and provides uncertainty quantification.The NN approach yields a smoother local volatility and a better backtesting performance, as its training criterion incorporates a local volatility regularization term.

translated by 谷歌翻译